Loan Data Analysis by Eliu Schmitt (mail@eliu.de), Dec. 2016
The data from the peer to peer lending platform Prosper contains 113.937 entries with 81 variables.
The entries represent 113.066 unique credit listings pulled from the platform between November 2005 and March 2014. Double entries were removed from the dataset.
The 81 different variables contain information on the loan, the borrower and the lenders/investors.
Several variables contain information on the loan.
LoanOriginalAmount: The origination amount of the loan.
The amount of the loans range from 1,000 to 35,000 USD. This are probably the maximum and minimum amounts specified by Prosper. The distribution is negatively skewed with very distinct modes at about 4,000, 7,500, 10,000, 15,000, 20,000 and 25,000 USD. People tend to round to these numbers even while the amount they need the loan for are probably not structured like this.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 6300 8315 12000 35000
MonthlyLoanPayment: The scheduled monthly loan payment.
The monthly payment seems to be rather low around the 200 $. There is a very high count of loans with monthly payments of about 170 USD that should be investigated. The distributions seems to be multimodal with local peaks at 100, 170, 300, 500 and 700 USD.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 131.0 217.4 271.9 370.6 2252.0
The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket.
The relative low number of cancelled loans is of no further interest and was removed from the dataset.
The amount of written off (‘Chargedoff’) and defaulted credits compared to the completed are unexpectedly high. About 30 % seem to be not able to pay back the loan.
The following chart shows a more simplified view to compare the successful and defaulted loans, counting “Defaulted” and “Chargedoff” as “Not Successful”, “Completed” as “Successful” and all other as “Open”.
The plot shows a failure rate of about 30 % of all loans (Success rate overall: 0.691).
## # A tibble: 3 × 3
## LoanStatus.Bucket n ratio
## <fctr> <int> <dbl>
## 1 Not successful 17010 0.1504498
## 2 Open 57990 0.5129090
## 3 successful 38061 0.3366413
## # A tibble: 2 × 3
## LoanStatus.Bucket n ratio
## <fctr> <int> <dbl>
## 1 Not successful 17010 0.308874
## 2 successful 38061 0.691126
Term: The length of the loan expressed in months.
The absence of 2 year and 4 year loans is somewhat surprising. The majority were 3 year loans.
##
## 36 60 12
## 87219 24228 1614
Borrower Rate: The Borrower’s interest rate for this loan. LenderYield: The Lender yield on the loan. Lender yield is equal to the interest rate on the loan less the servicing fee.
The LenderYield is a bit lower than the BorrowerRate but follows the same distribution. The distributions seems to be roughly normal except for a distinct peak at ca. 33 %.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1340 0.1840 0.1929 0.2506 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0100 0.1245 0.1740 0.1829 0.2406 0.4925
ProsperScore: A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009.
I treated the Score as a category rather than a continuous value since it is not known how the risk is scaled. It is also not known if 11 is a better score than 10 or if 11 marks loans without a score. Most risk scores fall in the middle range (4 to 8).
The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans
The borrowers mostly specified “Debt Consolidation” as their intended use what is reasonable. Quite some did not specify the purpose. Other notable uses were “Home Improvement” and “Business”.
The Occupation selected by the Borrower at the time they created the listing.
The factor includes 68 unique options. The biggest group was “Other” and another group left the field blank. There is a good chance this was to protect their privacy. If the data is true, some of the occupations are surprising: I would guess that Doctors, Judges, Dentist or other respectable professions would have other - less expensive options - to fund them self.
The employment status of the borrower at the time they posted the listing.
A simplified shows that the unemployed are a minority.
The length in months of the employment status at the time the listing was created.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 26.00 67.00 96.06 137.00 755.00 7621
A Borrower will be classified as a homeowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner.
About every second borrower is a home owner. This might raise the question why they wont use the house to get a bank loan on the house.
Specifies whether or not the Borrower was in a group at the time the listing was created.
The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation.
Only 12,712 borrowers were in one of 707 distinct groups.
## False True
## 100349 12712
## [1] 707
Number of accounts delinquent at the time the credit profile was pulled.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 0.595 0.000 83.000 696
Dollars delinquent at the time the credit profile was pulled.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 986.8 0.0 463900.0 7618
IncomeRange The income range of the borrower at the time the listing was created.
Most borrowers specified an income range. The majority had an annual income between 25,000 USD and 100,000 USD with quite some incomes above 100,000 USD. This was unexpected, since these borrowers would probably also have access to cheaper loan from other sources.
IncomeVerifiable The borrower indicated they have the required documentation to support their income.
Most borrowers marked their income as verifiable.
## False True
## 8586 104475
StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.
The stated monthly income ranges from 0 to nearly 2 Mil. 75 % stated an income of at leas 3,200 USD per month. There are several very distinct peaks in the histogram maybe indication some sort of clusters.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5605 6825 1750000
Recommendations: Number of recommendations the borrower had at the time the listing was created.
Only 4,257 borrowers had Recommendations at all. Maybe the feature wasn’t used that much.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.04838 0.00000 39.00000
## [1] "Without recommendations"
## [1] 108804
## [1] "With recommendations"
## [1] 4257
InvestmentFromFriendsCount: Number of friends that made an investment in the loan.
Only 2,131 borrowers hand friends investing in them through the platform. This might be due to the reason that Prosper takes a few and friends can just borrow money directly skipping the fee.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.02364 0.00000 33.00000
## [1] 2131
Investors The number of investors that funded the loan.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 44.00 80.88 116.00 1189.00
There are 113,937 unique loans from the peer to peer lending platform Prosper. The data contains 81 different features about the loan, the borrower and the investors/lenders.
I analyzed the following features:
LoanOriginalAmount Most loans were in a lower range. The median is 6,300 USD and the mean 8,135 USD.
MonthlyLoanPayment Most loan payments are below 370 USD per month (3rd quartile). There is an unexpected spike at 170 USD per month.
Loan Status Factor that shows if the loan is current, if it was payed back or if the borrower did not pay it pack.
Term The length of the loan expressed in months.
LenderYield
Interest for the borrower and the lender per annum. The max is with nearly 50 % very height. The mean and median are with about 18.4 % and 19.3 % are also very high.
Prosper Score A custom risk score built using historical Prosper data. Used as a factor.
ListingCategory The category of the listing that the borrower selected when posting their listing.
Many borrowers did not specify a category. Those who did mostly choose Debt Consolidation.
The most often chosen occupation was “Other”.
EmploymentStatus The employment status of the borrower at the time they posted the listing.
EmploymentStatusDuration The length in months of the employment status at the time the listing was created.
IsBorrowerHomeowner About half of the borrowers specified they were homeowners.
GroupKey Specifies whether or not the Borrower was in a group at the time the listing was created.
The Key of the group in which the Borrower is a member of. Value will be null if the borrower does not have a group affiliation
Only few people were in a group.
CurrentDelinquencies Number of accounts delinquent at the time the credit profile was pulled.
AmountDelinquent Dollars delinquent at the time the credit profile was pulled.
IncomeRage IncomeRange The income range of the borrower at the time the listing was created.
A factor containing the levels “$0” “$1-24,999” “$25,000-49,999” “$50,000-74,999” “$75,000-99,999”
“$100,000+” “Not employed” “Not displayed”
IncomeVerifiable Most borrowers marked their income as verifiable.
StatedMonthlyIncome StatedMonthlyIncome The monthly income the borrower stated at the time the listing was created.
75 % stated an income of at leas 3,200 USD per month.
DeptToIncomeRatio
Recommendations Recommendations: Number of recommendations the borrower had at the time the listing was created.
Only 4,257 borrowers had Recommendations at all.
InvestmentFromFriendsCount Only 2,131 borrowers hand friends investing in them through the platform.
Investors The number of investors that funded the loan.
The listings were pulled from the Prosper database between November 2005 and March 2014. Double entries were removed from the dataset.
The main interest is in these two features:
I try to find out which features can predict if the loan will be payed back and what features influence the yield of the loan.
The income, occupation, employment status will likely influence the success and the yield of the loan.
I created a simplified factor for the loan status only looking at Open, Successful, Not successful. Overdue payments were counted as Open and defaulted and written off were both counted as Not successful.
I created a new factor only distinguishing between employed or not employed.
There were a few hundred double entries that were cleaned. These were inspected and removed.
I’m first looking at the difference between the group of successful loans and the loans that were not payed back. The table shows several differences.
## # A tibble: 2 × 7
## LoanStatus.Bucket mean_stated_income mean_loan_amount mean_borrower_rate
## <fctr> <dbl> <dbl> <dbl>
## 1 Not successful 4450.824 6424.842 0.2317624
## 2 successful 5324.390 6188.146 0.1864263
## # ... with 3 more variables: mean_debt_to_inc_ratio <dbl>,
## # mean_duration_employment <dbl>, n <int>
There is a strong correlation between the success of an loan an the stated monthly income. The difference between the groups is about - 666 USD in income per month.
##
## Wilcoxon rank sum test with continuity correction
##
## data: StatedMonthlyIncome by LoanStatus.Bucket
## W = 272290000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -717.3333 -625.0000
## sample estimates:
## difference in location
## -666.6667
## # A tibble: 2 × 4
## LoanStatus.Bucket mean_income median_income n
## <fctr> <dbl> <dbl> <int>
## 1 Not successful 4450.824 3750.000 17010
## 2 successful 5324.390 4416.667 38061
The amount of the loan follows a similar distribution for the successful and not successful loans. The mean amount for the failed is a bit higher. At an alpha level of 0.05 the difference would still be significant. The effect is very low. This might change looking only at specific parts. All loans above 25,000 USD were always payed back.
## # A tibble: 2 × 5
## LoanStatus.Bucket mean_amount median_amount max_amount min_amount
## <fctr> <dbl> <dbl> <int> <int>
## 1 Not successful 6424.842 4500 25000 1000
## 2 successful 6188.146 4500 35000 1000
##
## Wilcoxon rank sum test with continuity correction
##
## data: LoanOriginalAmount by LoanStatus.Bucket
## W = 327860000, p-value = 0.01594
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 6.215555e-05 1.190726e-05
## sample estimates:
## difference in location
## 6.037318e-05
The short 12 month loans seems the be very save with a success rate of nearly 95 %. The difference between the 3 year and the 5 year loan is relatively low.
## # A tibble: 3 × 3
## Term success_ratio n
## <int> <dbl> <int>
## 1 12 0.9464752 1532
## 2 36 0.6859189 49847
## 3 60 0.6554713 3692
The borrower rate for the successful loans is a lot lower than for the not successful loans. The average rate is 4.95 % lower with the successful loans.
##
## Wilcoxon rank sum test with continuity correction
##
## data: BorrowerRate by LoanStatus.Bucket
## W = 4.3e+08, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## 0.04767575 0.05000017
## sample estimates:
## difference in location
## 0.04948877
As expected, the employed borrowers tend to pay back their loan more often, although the failure rate for the unemployed is still below 50 %.
## # A tibble: 2 × 3
## IsEmployed success_ratio n
## <fctr> <dbl> <int>
## 1 Employed 0.7110333 45417
## 2 Not employed 0.5974726 9654
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(ld.without_current$LoanStatus.Bucket, ld.without_current$IsEmployed)
## X-squared = 480.44, df = 1, p-value < 2.2e-16
The successful loans had more investors on average. This might point to some sort of social network effect, people warning each other or similar.
## # A tibble: 2 × 4
## LoanStatus.Bucket max_investors mean_investors median_investors
## <fctr> <int> <dbl> <dbl>
## 1 Not successful 881 97.45256 60
## 2 successful 1189 108.25094 74
##
## Wilcoxon rank sum test with continuity correction
##
## data: Investors by LoanStatus.Bucket
## W = 297990000, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
## 95 percent confidence interval:
## -9.999995 -7.000002
## sample estimates:
## difference in location
## -8.000034
Loans with investment from friends were more likely to be successful. Most people did not have investment from friends.
## # A tibble: 2 × 4
## LoanStatus.Bucket max_ mean_ median_
## <fctr> <int> <dbl> <dbl>
## 1 Not successful 9 0.03386243 0
## 2 successful 33 0.05160138 0
## # A tibble: 2 × 3
## HasInvestmentFromFriends success_ratio n
## <lgl> <dbl> <int>
## 1 FALSE 0.6884164 53058
## 2 TRUE 0.7625435 2013
As expected, the success rate becomes better with a higher ProsperScore
## # A tibble: 12 × 3
## ProsperScore success_ratio n
## <fctr> <dbl> <int>
## 1 1 0.5562806 613
## 2 2 0.6542923 1293
## 3 3 0.6875399 1565
## 4 4 0.6692308 2730
## 5 5 0.6734694 3283
## 6 6 0.7215318 3943
## 7 7 0.7470644 3151
## 8 8 0.8216659 4514
## 9 9 0.8818040 3215
## 10 10 0.9534180 1653
## 11 11 0.9687500 32
## 12 NA 0.6331029 29079
The borrower could state a purpose for the loan. Only looking at categories with n > 50, Motorcycle was the best. No specification is a risk since these were the least reliable loans. The most common category Debt Restructuring still had a slightly better than average rate.
## # A tibble: 16 × 3
## ListingCategory..names. success_ratio n
## <chr> <dbl> <int>
## 1 Motorcycle 0.9148936 94
## 2 Engagement Ring 0.9047619 63
## 3 Auto 0.7862248 1539
## 4 Other 0.7531457 6040
## 5 Home Improvement 0.7502313 3243
## 6 Taxes 0.7476636 214
## 7 Large Purchases 0.7458564 181
## 8 Student Use 0.7420635 756
## 9 Debt Consolidation 0.7368126 17858
## 10 Vacation 0.7323232 198
## 11 Wedding Loans 0.7309942 171
## 12 Personal Loan 0.6764092 2395
## 13 Business 0.6689380 4256
## 14 Medical/Dental 0.6459459 370
## 15 Household Expenses 0.6214405 597
## 16 Not Available 0.6049681 16948
Only looking at professions who had at least 100 loans, Military Officers seem to be the most reliable.
## # A tibble: 49 × 4
## Occupation success_ratio fail_ratio n
## <fctr> <dbl> <dbl> <int>
## 1 Military Officer 0.9004975 0.09950249 201
## 2 Attorney 0.8868360 0.11316397 433
## 3 Student - College Graduate Student 0.8834951 0.11650485 206
## 4 Scientist 0.8763441 0.12365591 186
## 5 Computer Programmer 0.8311949 0.16880513 2494
## 6 Analyst 0.8165447 0.18345526 1777
## 7 Doctor 0.8122066 0.18779343 213
## 8 Engineer - Chemical 0.8000000 0.20000000 110
## 9 Engineer - Electrical 0.8000000 0.20000000 545
## 10 Student - College Senior 0.7878788 0.21212121 165
## # ... with 39 more rows
The Military Officers also payed less interest.
## [1] "All borrowers:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1350 0.1945 0.2004 0.2699 0.4975
## [1] "Only Military Officers:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0425 0.0920 0.1500 0.1708 0.2400 0.3500
The homeowners were a bit more likely to pay back the loan but not by much. The p value is still very low so the difference is still significant.
## # A tibble: 2 × 4
## IsBorrowerHomeowner success_ratio fail_ratio n
## <fctr> <dbl> <dbl> <int>
## 1 False 0.6778117 0.3221883 29191
## 2 True 0.7061437 0.2938563 25880
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(ld.without_current$IsBorrowerHomeowner, ld.without_current$LoanStatus.Bucket)
## X-squared = 51.451, df = 1, p-value = 7.342e-13
The interest rate (BorrowerRate) is higher for small loans. This was not expected as the mean amount for the unsuccessful loans was higher than for the successful therefore smaller loans seem to be less risky.
##
## Pearson's product-moment correlation
##
## data: ld$BorrowerRate and ld$LoanOriginalAmount
## t = -116.89, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3335464 -0.3231454
## sample estimates:
## cor
## -0.3283558
Borrowers who stated a higher monthly income payed less interest. The correlations is not very strong.
##
## Pearson's product-moment correlation
##
## data: ld$BorrowerRate and ld$StatedMonthlyIncome
## t = -29.95, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.09450222 -0.08293603
## sample estimates:
## cor
## -0.08872211
In plot shows that the interest rate goes down with higher monthly payments. This is not expected since higher interest should lead to higher monthly payments. In the plot some steep “stripes” are visible that represent the correlation that is expected. These stripes should represent loans of the same amount and duration that just differ by the interest rate.
##
## Pearson's product-moment correlation
##
## data: ld$BorrowerRate and ld$MonthlyLoanPayment
## t = -84.592, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2494531 -0.2384891
## sample estimates:
## cor
## -0.2439789
The following plot shows the relationship between the monthly payment and the amount dependent on the term of the loan. There is a difference in the relationship starting at loans grater than 25,000 USD. The width of the distribution is narrowing. Like in the chart plotting the loan amount towards the interest rate there is less variance after 25,000 USD.
The not successful loans usually had to pay an higher interest rate.
The unemployed had to pay higher interest on their loan.
Doctors pay the lowest interest on average. The Military Officer who had the best reliability only makes 13th place.
## # A tibble: 62 × 3
## Occupation mean_interest n
## <fctr> <dbl> <int>
## 1 Doctor 0.1608967 492
## 2 Pharmacist 0.1640706 255
## 3 Engineer - Chemical 0.1675559 222
## 4 Computer Programmer 0.1680567 4442
## 5 Attorney 0.1684318 1033
## 6 Pilot - Private/Commercial 0.1688667 198
## 7 Engineer - Electrical 0.1692419 1118
## 8 Professor 0.1707583 556
## 9 Scientist 0.1710580 367
## 10 Architect 0.1719108 212
## # ... with 52 more rows
I found a relationship between the success that a loan will be payed back or not with several features. A higher ProsperScore was an indicator, as well as the StatedMonthlyIncome, the EmploymentStatus, the Occupation and the BorrowerRate. The Term (length) of the loan was most interesting for the (few) 1 year loans. The amount of the loan also seems to be not highly connected.
The BorrowerRate (interest rate the borrower had to pay),vary with the ProsperScore, the StatedMonthlyIncome and the EmploymentStatus. The interest is also lower for loans with a higher amount. The rate also varies with the occupation.
The StatedMonthlyIncome and the LoanAmount are correlated. The stated monthly income for Loans greater than 25,000 USD was equal or above 8,333 USD a month. All lower loans also had borrowers without a stated income (or an stated income of 0 USD). Higher loans also mostly stated DeptRestruturing as the main purpose. This is surprising, as a high monthly income should secure other loans as well.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8333 9583 11600 13340 14580 66670
##
## Pearson's product-moment correlation
##
## data: ld$StatedMonthlyIncome and ld$LoanOriginalAmount
## t = 68.854, df = 113060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1950122 0.2062010
## sample estimates:
## cor
## 0.2006131
The strongest relationships are those obviously needed by the business model. Higher loans correspond with higher monthly payments, as well as higher interest rates correspond with higher monthly payments.
Besides this, the 12 Month Term and the ProsperScore of 10 had a very strong relationship with the success of the loan.
While the employment correlates with a better success chance and the purpose DeptRestructuring also had a better chance of success this was partly reversed when combining the two features. DeptRestructuring lead to an even better success chance for the employed, but lowered the chance for the unemployed.
## # A tibble: 2 × 3
## IsEmployed success_ratio n
## <fctr> <dbl> <int>
## 1 Employed 0.7110333 45417
## 2 Not employed 0.5974726 9654
## # A tibble: 2 × 3
## IsDeptRestructuring success_ratio n
## <lgl> <dbl> <int>
## 1 FALSE 0.6692016 37213
## 2 TRUE 0.7368126 17858
## Source: local data frame [4 x 4]
## Groups: IsEmployed [?]
##
## IsEmployed IsDeptRestructuring success_ratio n
## <fctr> <lgl> <dbl> <int>
## 1 Employed FALSE 0.6916209 28225
## 2 Employed TRUE 0.7429037 17192
## 3 Not employed FALSE 0.5987984 8988
## 4 Not employed TRUE 0.5795796 666
A prosper score and a higher monthly income correlate with a higher success chance. Combined the income is higher at ProsperScore 10. Even so at lower scores the difference in income for the successful and unsuccessful is bigger. Furthermore, there isn’t a big difference in lower ProsperScores.
The first plot shows how the BorrowerRate, the ProsperScore and the success of the loan correspond. The higher chance of success is easily spotted as well at the lower interest rate borrowers had to pay with a higher prosper score.
The second plot shows the strong effect of the employment of the borrower and how this distribution varies for different listing categories. The plot shows that Debt Restructuring was mostly stated by employed, while unemployed often also did not specify a listing category.
The third plot shows how the successful and not successful loans are connected through the numbers of investors. More investors mean a better chance of success.
I was able to look at different features in the dataset. Many features are significant for the determination if the loan will be payed back or not but mostly the effect strength is small. Some features have a very high influence but the feature is only limited to few instances and therefore often not of great help. While loans with a term of 12 month were very successful, only few loans were in this subgroup.
The problem of determining whether a borrower will pay back it’s loan or not remains difficult. The most widely available indicator for a success was if the borrower is employed.
Problematic with the dataset was also that most personal data are statements by the borrowers. They might made mistakes or lied.
Furthermore, some of the features I analyzed are only interesting in hindsight, since a typical question would be to determine the credit worthiness of a new borrower to determine the adequate interest rate.